This report investigates the wineQualityReds.csv dataset, consisting of
13 variables for 1599 observations.
Quick look - column names and first observation
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.7 0 1.9 0.076
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## quality
## 1 5
structure - wineQualityReds.csv
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
My primary interest is the relationship between red wine quality and
alcohol content. Also of interest is the relationship between pH,
and the other 3 acidity metrics, fixed.acidity, volatile.acidity, citric.acid.
Drill down into the individual attributes.
Red Wine quality analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 5.000 6.000 5.636 6.000 8.000
The quality attribute approximates a normal distribution.
Red Wine alcohol analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.40 9.50 10.20 10.42 11.10 14.90
The alcohol distribution is right skewed, log10 plot follows.
Red Wine log10 alcohol analysis.
The alcohol log10 plot approximates a normal distribution.
Red Wine pH analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.740 3.210 3.310 3.311 3.400 4.010
The pH attribute approximates a normal distribution.
Red Wine fixed acidity analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.60 7.10 7.90 8.32 9.20 15.90
The fixed acidity attribute approximates a normal distribution.
Red Wine volatile acidity analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The volatile acidity attribute approximates a normal distribution.
Red Wine citric.acid analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 0.090 0.260 0.271 0.420 1.000
Red Wine residual sugar analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.900 1.900 2.200 2.539 2.600 15.500
The residual sugar distribution is right skewed, log10 plot follows.
Red Wine log10 residual sugar analysis.
The residual sugar log10 plot approximates a normal distribution.
Red Wine density analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
The density attribute approximates a normal distribution.
Red Wine chlorides analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The chlorides distribution is right skewed, log10 plot follows.
Red Wine log10 chlorides analysis.
The chlorides log10 plot approximates a normal distribution.
Red Wine free sulfur dioxide analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 7.00 14.00 15.87 21.00 72.00
The free sulfur dioxide distribution is right skewed, log10 plot follows.
Red Wine log10 free sulfur dioxide analysis.
Red Wine total sulfur dioxide analysis.
Min. 1st Qu. Median Mean 3rd Qu. Max.
6.00 22.00 38.00 46.47 62.00 289.00
The total sulfur dioxide distribution is right skewed, log10 plot follows.
Red Wine log10 total sulfur dioxide analysis.
The total sulfur dioxide log10 plot approximates a normal distribution.
Red Wine sulfates analysis.
The sulfates attribute approximates a normal distribution.
The dataset sourced from wineQualityReds.csv has 1599 entries each with 13 features.
The 11 num features of interest are:
fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulfates, alcohol
quality is a type int feature
The quality feature range is 3 through 8.
3 represents a lower quality wine while 8 indicates a higher quality wine.
quality was analyzed as a categorical variable.
wineQualityReds.csv was downloaded from
https://s3.amazonaws.com/ udacity-hosted-downloads/ud651/wineQualityReds.csv&sa=D&ust=1530252422777000
My interest is exploring a possible relationship between alcohol and quality.
For example does alcohol content appear to influence the subjective quality rating of the wine.
I am also interested in the 4 acidity metrics pH, fixed.acidity, volatile.acidity, and citric acid. I am interested in how these metrics relate to each other, and how they relate to the quality metric.
I will use Bivariate Plots, Bivariate Analysis, Multivariate Plots, and
Multivariate Analysis to further my analysis of the alcohol, quality and acidity metrics.
The following red wine characteristics natively demonstrated a normal distribution -
The following red wine characteristics natively demonstrated a skewed distribution -
Follow up log10 plotting showed a near normal distribution for the following -
Initial plot of citric.acid did not reveal a recognizable distribution.
Follow up free.sulfur.dioxide log10 plot did not reveal a recognizable distribution.
Spearman correlation interpretation.
summary statistics - quality
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
summary statistics - alcohol
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
pH summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
volatile acidity summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
fixed acidity summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
citric acid summary statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Red Wine alcohol - quality analysis, alcohol percent by volume.
Higher quality rated wines, 6, 7, and 8 contain progressively higher alcohol.
Red Wine alcohol - quality analysis, quality percent by volume.
Quantify the strength of the quality - alcohol relationship.
Pearson’s r
Pearson's product-moment correlation
data: quality and alcohol
t = 21.639, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4373540 0.5132081
sample estimates:
cor
0.4761663
Correlation Interpretation
quality - alcohol correlation is greater than the accepted meaningful correlation threshold - 0.3
quality - alcohol correlation less than the accepted moderate correlation threshold - 0.5
Leverage the above quality - alcohol visualizations to fine tune correlation analysis.
Quantify the strength of the quality - alcohol relationship - quality > 4.
Pearson’s r
Pearson's product-moment correlation
data: quality and alcohol
t = 23.962, df = 1534, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.4845165 0.5573539
sample estimates:
cor
0.5218858
Correlation Interpretation - subset quality > 4
subset quality - alcohol correlation is greater than the accepted moderate correlation threshold - 0.5
subset quality - alcohol correlation is less than the accepted large, strong correlation threshold - 0.7
Higher quality red wines have a higher alcohol correlation.
Alcohol and quality proportional.
Pearson’s r - quality - volatile acidity correlation
Pearson's product-moment correlation
data: wineQualityReds$quality and wineQualityReds$volatile.acidity
t = -16.954, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.4313210 -0.3482032
sample estimates:
cor
-0.3905578
quality - volatile acidity negative correlation - meaningful but weak
Pearson’s r - quality - pH correlation
Pearson's product-moment correlation
data: wineQualityReds$quality and wineQualityReds$pH
t = -2.3109, df = 1597, p-value = 0.02096
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.106451268 -0.008734972
sample estimates:
cor
-0.05773139
quality - pH negative correlation less than the accepted weak threshold - -0.3
Pearson’s r - quality - fixed acidity correlation
Pearson's product-moment correlation
data: wineQualityReds$quality and wineQualityReds$fixed.acidity
t = 4.996, df = 1597, p-value = 6.496e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.07548957 0.17202667
sample estimates:
cor
0.1240516
quality - fixed acidity correlation less than the accepted weak threshold - 0.3
Pearson’s r - quality - citric.acid correlation
Pearson's product-moment correlation
data: wineQualityReds$quality and wineQualityReds$citric.acid
t = 9.2875, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1793415 0.2723711
sample estimates:
cor
0.2263725
quality - citric acid correlation less than the accepted weak threshold - 0.3
Pearson’s r - pH - fixed acidity correlation
Pearson's product-moment correlation
data: wineQualityReds$pH and wineQualityReds$fixed.acidity
t = -37.366, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.7082857 -0.6559174
sample estimates:
cor
-0.6829782
pH - fixed acidity - moderate correlation
Pearson’s r - pH - citric acid correlation
Pearson's product-moment correlation
data: wineQualityReds$pH and wineQualityReds$citric.acid
t = -25.767, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5756337 -0.5063336
sample estimates:
cor
-0.5419041
pH - citric acid - moderate correlation
Pearson’s r - pH - volatile acidity correlation
Pearson's product-moment correlation
data: wineQualityReds$pH and wineQualityReds$volatile.acidity
t = 9.659, df = 1597, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.1880823 0.2807254
sample estimates:
cor
0.2349373
pH - volatile acidity correlation less than the accepted weak threshold - 0.3
red wine quality - citric acid
reminder - pH and acidity are inversely proportional
For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).
Red wine pH, quality follow up analysis.
Quality tends to increase as pH, alkalinity increases.
Quality tends to increase as alcohol percent by volume increases.
A comprehensive definitive pH trend is not illustrated.
No discernible color gradient pattern, striations observed.
A comprehensive definitive Fixed Acidity trend is not illustrated.
No discernible color gradient pattern, striations observed.
A comprehensive definitive Volatile Acidity trend is not illustrated.
No discernible color gradient pattern, striations observed.
A comprehensive definitive Citric Acid trend is not illustrated.
No discernible color gradient pattern, striations observed.
For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).
There was not a comprehensive explicit relationship seen when pH was color
plotted on the quality alcohol plot.
There was not a comprehensive explicit relationship seen when Fixed Acidity was
color plotted on the quality alcohol plot.
There was not a comprehensive explicit relationship seen when Volatile Acidity
was color plotted on the quality alcohol plot.
There was not a comprehensive explicit relationship seen when Citric Acid was
color plotted on the quality alcohol plot.
For most quality ratings, alcohol increase is accompanied by pH increase (lower acidity).
I chose this plot because it communicates the relationship between alcohol content and pH,
broken out by quality rating.
Pearson’s r mathematically augments the visualizations.
I chose this plot because it communicates the strong relationship between alcohol content and quality rating.
Per the Centers for Disease Control and Prevention -
“Alcohol use slows reaction time and impairs judgment…”
CDC’s acknowledgement of the effect alcohol has on judgement indicates further analysis is indicated.
Does alcohol content effect subjective quality rating reporting?
Further analysis would include a controlled experiment where alcohol concentration was the only variable.
The results would be used to further analyze and refine the relationship between alcohol concentration
and the subjective quality rating.
I chose this plot because of the observed relationship between Volatile Acidity and pH.
The plot suggests a rise in pH (more alkaline, less acid measurements), as Volatile Acidity increases.
Initially one might assume if a measure of acidity such as Volatile Acidity increased
there would be a corresponding decrease in pH, a more acidic pH measurement.
I chose this plot because it illustrates the relationship between:
Higher quality wines tend to have centric pH values, and higher alcohol content.
Additional studies - density - alcohol - quality analysis.
I chose this plot because it illustrates the relationship between:
Higher quality wines tend to have lower density and higher alcohol content,
What was surprising?
The unexpected relationship between Volatile Acidity and pH.
As Volatile Acidity increased, pH increased. pH is a measure of acidity.
It was unexpected to see the Volatile Acidity increase yield a more alkaline, less acidic pH result.
Future work.
Further investigate the relationship between alcohol content and subjective quality rating.
Next step - Additional quality rating analysis based on a controlled experiment where alcohol concentration
is the only variable.